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ABSTRACT 

In  the  past  year  at  Carnegie  Mellon  steady  progress  has  been  made 
in  the  area  of  acoustic  and  language  modeling.  The  result  has  been 
a  dramatic  reduction  in  speech  recognition  errors  in  the  SPHINX-II 
system.  In  this  paper,  we  review  SPHINX-II  and  summarize  our  re¬ 
cent  efforts  on  improved  speech  recognition.  Recently  SPHINX-II 
achieved  the  lowest  eiror  rate  in  the  November  1992  DARPA  eval¬ 
uations.  For  5000-word,  speaker-independent,  continuous,  speech 
recognition,  the  error  rate  was  reduced  to  5%. 

1.  INTRODUCTION 

At  Carnegie  Mellon,  we  have  made  significant  progress 
in  large-vocabulary  speaker-independent  continuous  speech 
recognition  during  the  past  years  [16, 15,  3, 18, 14],  In  com¬ 
parison  with  the  SPHINX  system  [23],  SPHINX-II  offers  not 
only  significantly  fewer  recognition  errors  but  also  the  capa¬ 
bility  to  handle  a  much  larger  vocabulary  size.  For  5, 000- word 
speaker-independent  speech  recognition,  therecognition  error 
rate  has  been  reduced  to  5%.  This  system  achieved  the  lowest 
error  rate  among  all  of  the  systems  tested  in  the  November 
1 992  DARPA  evaluations,  where  the  testing  set  has  330  utter¬ 
ances  collected  from  8  new  speakers.  Currently  we  are  refin¬ 
ing  and  extending  these  and  related  technologies  to  develop 
practical  unlimited-vocabulary  dictation  systems,  and  spoken 
language  systems  for  general  application  domains  with  larger 
vocabularies  and  reduced  linguistic  constraints. 

One  of  the  most  important  contributions  to  our  systems  de¬ 
velopment  has  been  the  availability  of  large  amounts  of  train¬ 
ing  data.  In  our  current  system,  we  used  about  7200  utter¬ 
ances  of  read  Wall  Street  Journal  (WSJ)  text,  collected  from 
84  speakers  (half  male  and  half  female  speakers)  for  acous¬ 
tic  model  training;  and  45-million  words  of  text  published 
by  the  WSJ  for  language  model  training.  In  general,  more 
data  requires  different  models  so  that  more  detailed  acoustic- 
phonetic  phenomena  can  be  well  characterized.  Towards 
this  end,  our  recent  progress  can  be  broadly  classified  into 
feature  extraction,  detailed  representation  through  parameter 
sharing,  search,  and  language  modeling.  Our  specific  contri¬ 
butions  in  SPHINX-II  include  normalized  feature  represen¬ 
tations,  multiple-codebook  semi-continuous  hidden  Markov 
models,  between-word  senones,  multi-pass  search  algorithms, 
long-distance  language  models,  and  unified  acoustic  and  lan¬ 


guage  representations.  The  SPHINX-II  system  block  diagram 
is  illustrated  in  Figure  1 ,  where  feature  codebooks,  dictionary, 
senones,  and  language  models  are  iteratively  reestimated  with 
the  semi -continuous  hidden  Markov  model  (SCHMM),  albeit 
not  all  of  them  are  jointly  optimized  for  the  WSJ  task  at 
present.  In  this  paper,  we  will  characterize  our  contributions 


Figure  1:  Sphinx-II  System  Diagram 

by  percent  error  rate  reduction.  Most  of  these  experiments 
were  performed  on  a  development  test  set  for  the  5000-word 
WSJ  task.  This  set  consists  of  410  utterances  from  10  new 
speakers. 

2.  FEATURE  EXTRACTION 

The  extraction  of  reliable  features  is  one  of  the  most  impor¬ 
tant  issues  in  speech  recognition  and  as  a  result  the  training 
data  plays  a  key  role  in  this  research.  However  the  curse  of 
dimensionality  reminds  us  that  the  amount  of  training  data 
will  always  be  limited.  Therefore  incorporation  of  additional 
features  may  not  lead  to  any  measurable  error  reduction.  This 
does  not  necessarily  mean  that  the  additional  features  are 
poor  ones,  but  rather  that  we  may  have  insufficient  data  to 
reliably  model  those  features.  Many  systems  that  incorporate 
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environmentally-robust  [1]  and  speaker-robust  [11]  models 
face  similar  constraints. 

2.1.  MFCC  Dynamic  Features 

Temporal  changes  in  the  spectra  are  believed  to  play  an  im¬ 
portant  role  in  human  perception.  One  way  to  capture  this  in¬ 
formation  is  to  use  delta  coefficients  that  measure  the  change 
in  coefficients  over  time.  Temporal  information  is  particu¬ 
larly  suitable  for  HMMs,  since  HMMs  assume  each  frame  is 
independent  of  the  past,  and  these  dynamic  features  broaden 
the  scope  of  a  frame.  In  the  past,  the  SPHINX  system  has 
utilized  three  codebooks  containing  [23]:  (1)  12  LPC  cep- 
strum  coefficients  xt(k),  1  <=  k  <=  12;  (2)  12  differenced 
LPC  cepstrum  coefficients  (40  msec,  difference)  Axt(k), 
1  <=  k  <=  12;  (3)  Power  and  differenced  power  (40  msec.) 
*<(0)  and  Axt(0).  Since  we  are  using  a  multiple-codebook 
hidden  Markov  model,  it  is  easy  to  incorporate  new  features 
by  using  an  additional  codebook.  We  experimented  with  a 
number  of  new  measures  of  spectral  dynamics,  including: 
(1)  second  order  differential  cepstrum  and  power  (AAxt(k), 
1  <=  k  <=  12,  and  AAx<(0))  and  third  order  differential 
cepstrum  and  power.  The  first  set  of  coefficients  is  incor¬ 
porated  into  a  new  codebook,  whose  parameters  are  second 
order  differences  of  the  cepstrum.  The  second  order  differ¬ 
ence  for  frame  f,  AAxt(I:),  where  t  is  in  units  of  10ms, 
is  the  difference  between  t  +  1  and  t  -  1  first  order  differ¬ 
ential  coefficients,  or  AAxt(jb)  =  Axt-i(k)  -  Axt+1(ifc). 
Next,  we  incorporated  both  40  msec,  and  80  msec,  dif¬ 
ferences,  which  represent  short-term  and  long-term  spectral 
dynamics,  respectively.  The  80  msec,  differenced  cepstrum 
A x't(k)  is  computed  as:  Ax't(k)  =  xj_4(ifc)  -  x,+4(I:). 
We  believe  that  these  two  sources  of  information  are  more 
complementary  than  redundant.  We  incorporated  both  Axt 
and  Ax't  into  one  codebook  (combining  the  two  into  one 
feature  vector),  weighted  by  their  variances.  We  attempted 
to  compute  optimal  linear  combination  of  cepstral  segment, 
where  weights  are  computed  from  linear  discriminants.  But 
we  found  that  performance  deteriorated  slightly.  This  may 
be  due  to  limited  training  data  or  there  may  be  little  informa¬ 
tion  beyond  second-order  differences.  Finally,  we  compared 
mel-frequency  cepstral  coefficients  (MFCC)  with  our  bilinear 
transformed  LPC  cepstral  coefficients.  Here  we  observed  a 
significant  improvement  for  the  SCHMM  model,  but  noth¬ 
ing  for  the  discrete  model.  This  supported  our  early  findings 
about  problems  with  modeling  assumptions  [15].  Thus,  the  fi¬ 
nal  configuration  involves  51  features  distributed  among  four 
codebooks,  each  with  256  entries.  The  codebooks  are:  (1)  12 
mel-scale  cepstrum  coefficients;  (2)  12  40-msec  differenced 
MFCC  and  12  80-msec  differenced  MFCC;  (3)  12  second- 
order  differenced  MFCC;  and  (4)  power,  40-msec  differenced 
power,  second-order  differenced  power.  The  new  feature  set 
reduced  errors  by  more  than  25%  over  the  baseline  SPHINX 
results  on  the  WSJ  task. 


3.  DETAILED  MODELING  THROUGH 
PARAMETER  SHARING 

We  need  to  model  a  wide  range  of  acoustic-phonetic  phenom¬ 
ena,  but  this  requires  a  large  amount  of  training  data.  Since 
the  amount  of  available  training  data  will  always  be  finite  one 
of  the  central  issues  becomes  that  of  how  to  achieve  the  most 
detailed  modeling  possible  by  means  of  parameter  sharing. 
Our  successful  examples  include  SCHMMs  and  senones. 

3.1.  Semi-Continuous  HMMs 

The  semi-continuous  hidden  Markov  model  (SCHMM)  [12] 
has  provided  us  with  an  an  excellent  tool  for  achieving  detailed 
modeling  through  parameter  sharing.  Intuitively,  from  the 
continuous  mixture  HMM  point  of  view,  SCHMMs  employ  a 
shared  mixture  of  continuous  output  probability  densities  for 
each  individual  HMM.  Shared  mixtures  substantially  reduce 
the  number  of  free  parameters  and  computational  complex¬ 
ity  in  comparison  with  the  continuous  mixture  HMM,  while 
maintaining,  reasonably,  its  modeling  power.  From  the  dis¬ 
crete  HMM  point  of  view,  SCHMMs  integrate  quantization 
accuracy  into  the  HMM,  and  robustly  estimate  the  discrete 
output  probabilities  by  considering  multiple  codeword  can¬ 
didates  in  the  VQ  procedure.  It  mutually  optimizes  the  VQ 
codebook  and  HMM  parameters  under  a  unified  probabilistic 
framework  [13],  where  each  VQ  codeword  is  regarded  as  a 
continuous  probability  density  function. 

For  the  SCHMM,  an  appropriate  acoustic  representation  for 
the  diagonal  Gaussian  density  function  is  crucial  to  the  recog¬ 
nition  accuracy  [13].  We  first  performed  exploratory  semi- 
continuous  experiments  on  our  three-codebook  system.  The 
SCHMM  was  extended  to  accommodate  a  multiple  feature 
front-end  [13].  All  codebook  means  and  covariance  matrices 
were  reestimated  together  with  the  HMM  parameters  except 
the  power  covariance  matrices,  which  were  fixed.  When  three 
codebooks  were  used,  the  diagonal  SCHMM  reduced  the  er¬ 
ror  rate  of  the  discrete  HMM  by  10-15%  for  the  RM  task  [16]. 
When  we  used  our  improved  4-codebook  MFCC  front-end, 
the  error  rate  reduction  is  more  than  20%  over  the  discrete 
HMM. 

Another  advantage  of  using  the  SCHMM  is  that  it  requires  less 
training  data  in  comparison  with  the  discrete  HMM.  There¬ 
fore,  given  the  current  limitations  on  the  size  of  the  training 
data  set,  more  detailed  models  can  be  employed  to  improve 
the  recognition  accuracy.  One  way  to  increase  the  number 
of  parameters  is  to  use  speaker-clustered  models.  Due  to  the 
smoothing  abilities  of  the  SCHMM,  we  were  able  to  train 
multiple  sets  of  models  for  different  speakers.  We  investi¬ 
gated  automatic  speaker  clustering  as  well  as  explicit  male, 
female,  and  generic  models.  By  using  sex  dependent  models 
with  the  SCHMM,  the  error  rate  is  further  reduced  by  10%  on 
the  WSJ  task. 
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3.2.  Senones 

To  share  parameters  among  different  word  models,  context- 
dependent  subword  models  have  been  used  successfully  in 
many  state-of-the-art  speech  recognition  systems  [26, 21, 17]. 
The  principle  of  parameter  sharing  can  also  be  extended  to 
subphonetic  models  [19,  18].  We  treat  the  state  in  pho¬ 
netic  hidden  Markov  models  as  the  basic  subphonetic  unit 
—  senone.  Senones  are  constructed  by  clustering  the  state- 
dependent  output  distributions  across  different  phonetic  mod¬ 
els.  The  total  number  of  senones  can  be  determined  by  clus¬ 
tering  all  the  triphone  HMM  states  as  the  shared-distribution 
models  [18].  States  of  different  phonetic  models  may  thus 
be  tied  to  the  same  senone  if  they  are  close  according  to 
the  distance  measure.  Under  the  senonic  modeling  frame¬ 
work,  we  could  also  use  a  senonic  decision  tree  to  predict  un¬ 
seen  triphones.  This  is  particularly  important  for  vocabulary- 
independence  [10],  as  we  need  to  find  subword  models  which 
are  detailed,  consistent,  trainable  and  especially  generalizable. 
Recently  we  have  developed  a  new  senonic  decision-tree  to 
predict  the  subword  units  not  covered  in  the  training  set  [18]. 
The  decision  tree  classifies  senones  by  asking  questions  in  a 
hierarchical  manner  [7],  These  questions  were  first  created 
using  speech  knowledge  from  human  experts.  The  tree  was 
automatically  constructed  by  searching  for  simple  as  well  as 
composite  questions.  Finally,  the  tree  was  pruned  using  cross 
validation.  When  the  algorithm  terminated,  the  leaf  nodes 
of  the  tree  represented  the  senones  to  be  used.  For  the  WSJ 
task,  our  overall  senone  models  gave  us  35%  error  reduction 
in  comparison  with  the  baseline  SPHINX  results. 

The  advantages  of  senones  include  not  only  better  param¬ 
eter  sharing  but  also  improved  pronunciation  optimization. 
Clustering  at  the  granularity  of  the  state  rather  than  the  entire 
model  (like  generalized  triphones  [21])  can  keep  the  dissimi¬ 
lar  states  of  two  models  apart  while  the  other  corresponding 
states  are  merged,  and  thus  lead  to  better  parameter  shar¬ 
ing.  In  addition,  senones  give  us  the  freedom  to  use  a  larger 
number  of  states  for  each  phonetic  model  to  provide  more 
detailed  modeling.  Although  an  increase  in  the  number  of 
states  will  increase  the  total  number  of  free  parameters,  with 
senone  sharing  redundant  states  can  be  clustered  while  others 
are  uniquely  maintained. 

Pronunciation  Optimization.  Here  we  use  the  forward- 
backward  algorithm  to  iteratively  optimize  a  senone  sequence 
appropriate  for  modeling  multiple  utterances  of  a  word.  To 
explore  the  idea,  given  the  multiple  examples,  we  train  a  word 
HMM  whose  number  of  states  is  proportional  to  the  average 
duration.  When  the  Baum-Welch  reestimation  reaches  its 
optimum,  each  estimated  state  is  quantized  with  the  senone 
codebook.  The  closest  one  is  used  to  label  the  states  of  the 
word  HMM.  This  sequence  of  senones  becomes  the  senonic 
baseform  of  the  word.  Here  arbitrary  sequences  of  senones  are 
allowed  to  provide  the  flexibility  for  the  automatically  learned 


pronunciation.  When  the  senone  sequence  of  every  word  is 
determined,  the  parameters  (senones)  may  be  re-trained.  Al¬ 
though  each  word  model  generally  has  more  states  than  the 
traditional  phoneme-concatenated  word  model,  the  number 
of  parameters  remains  the  same  since  the  size  of  the  senone 
codebook  is  unchanged.  When  senones  were  used  for  pronun¬ 
ciation  optimization  in  a  preliminary  experiment,  we  achieved 
10-15%  error  reduction  in  a  speaker-independent  continuous 
spelling  task  [19]. 

4.  MULTI-PASS  SEARCH 

Recent  work  on  search  algorithms  for  continuous  speech 
recognition  has  focused  on  the  problems  related  to  large  vo¬ 
cabularies,  long  distance  language  models  and  detailed  acous¬ 
tic  modeling.  A  variety  of  approaches  based  on  Viterbi  beam 
search  [28,  24]  or  stack  decoding  [5]  form  the  basis  for  most 
of  this  work.  In  comparison  with  stack  decoding,  Viterbi 
beam  search  is  more  efficient  but  less  optimal  in  the  sense 
of  MAP.  For  stack  decoding,  a  fast-match  is  necessary  to  re¬ 
duce  a  prohibitively  large  search  space.  A  reliable  fast-match 
should  make  full  use  of  detailed  acoustic  and  language  mod¬ 
els  to  avoid  the  introduction  of  possibly  unrecoverable  errors. 
Recently,  several  systems  have  been  proposed  that  use  Viterbi 
beam  search  as  a  fast-match  [27, 29],  for  stack  decoding  or  the 
N-best  paradigm  [25].  In  these  systems,  N-best  hypotheses 
are  produced  with  very  simple  acoustic  and  language  models. 
A  multi-pass  rescoring  is  subsequently  applied  to  these  hy¬ 
potheses  to  produce  the  final  recognition  output.  One  problem 
in  this  paradigm  is  that  decisions  made  by  the  initial  phase 
are  based  on  simplified  models.  This  results  in  errors  that 
the  N-best  hypothesis  list  cannot  recover.  Another  problem 
is  that  the  rescoring  procedure  could  be  very  expensive  per 
se  as  many  hypotheses  may  have  to  be  rescored.  The  chal¬ 
lenge  here  is  to  design  a  search  that  makes  the  appropriate 
compromises  among  memory  bandwidth,  memory  size,  and 
computational  power  [3]. 

To  meet  this  challenge  we  incrementally  apply  all  available 
acoustic  and  linguistic  information  in  three  search  phases. 
Phase  one  is  a  left  to  right  Viterbi  Beam  search  which  produces 
word  end  times  and  scores  using  right  context  between-word 
models  with  a  bigram  language  model.  Phase  two,  guided 
by  the  results  from  phase  one,  is  a  right  to  left  Viterbi  Beam 
search  which  produces  word  beginning  times  and  scores  based 
on  left  context  between-word  models.  Phase  three  is  an  A* 
search  which  combines  the  results  of  phases  one  and  two  with 
a  long  distance  language  model. 

4.1.  Modified  A*  Stack  Search 

Each  theory,  th,  on  the  stack  consists  of  five  entries.  A  partial 
theory,  th.pt,  a  one  word  extension  th.w,  a  time  th.t  which 
denotes  the  boundary  between  th.pt  and  th.w,  and  two  scores 
th.g,  which  is  the  score  for  th.pt  up  to  time  th.t  and  th.h  which 
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is  the  best  score  for  the  remaining  portion  of  the  input  starting 
with  th.w  at  time  th.t+1  through  to  the  end.  Unique  theories 
are  determined  by  th.pt  and  th.w.  The  algorithm  proceeds  as 
follows. 

1.  Add  initial  states  to  the  stack. 

2.  According  to  the  evaluation  function  th.g+ th.h,  remove 
the  best  theory,  th,  from  the  stack. 

3.  If  th  accounts  for  the  entire  input  then  output  the  sentence 
corresponding  to  th.  Halt  if  this  is  the  Nth  utterance 
output. 

4.  For  the  word  th.w  consider  all  possible  end  times,  t  as 
provided  by  the  left/right  lattice. 

(a)  For  all  words,  w,  beginning  at  time  <+ 1  as  provided 
by  the  right/left  lattice 

i.  Extend  theory  th  with  w.  Designate  this 
theory  as  th’.  Set  th' .pt  =  th.pt  +  th.w, 
th' .w  ::=  w  and  th' .t  =  t. 

ii.  Compute  scores 

th' .g  =  th.g  +  wscore(w,th.t  +  1,<),  and 
th' .h.  See  following  for  definition  of  w score 
and  th'  .h  computation. 

iii.  If  th'  is  already  on  the  stack  then  choose  the 
best  instance  of  th'  otherwise  push  th'  onto 
the  stack. 

5.  Goto  step  2. 

4.2.  Discussion 

When  th  is  extended  we  are  considering  all  possible  end  times 
t  for  th.w  and  all  possible  extensions  w.  When  extending  th 
with  w  to  obtain  th’  we  are  only  interested  in  the  value  for 
th'.t  which  gives  the  best  value  for  th’.h  +  th’.g.  For  any  t 
and  w,  th’.h  is  easily  determined  via  table  lookup  from  the 
right/left  lattice.  Furthermore  the  value  of  th’.g  is  given  by 
th.g  +  w_score  (w,  th.t+1,  t).  The  function  w_score(w,b,e ) 
computes  the  score  for  the  word  w  with  begin  time  b  and  end 
time  e. 

Our  objective  is  to  maximize  the  recognition  accuracy  with 
a  minimal  increase  in  computational  complexity.  With 
our  decomposed,  incremental,  semi-between-word-triphones 
search,  we  observed  that  early  use  of  detailed  acoustic  mod¬ 
els  can  significantly  reduce  the  recognition  error  rate  with 
a  negligible  increase  computational  complexity  as  shown  in 
Figure  2. 

By  incrementally  applying  knowledge  we  have  been  able  to 
decompose  the  search  so  that  we  can  efficiently  apply  de¬ 
tailed  acoustic  or  linguistic  knowledge  in  each  phase.  Further 


Figure  2:  Comparison  between  early  and  late  use  of  knowl¬ 
edge. 

more,  each  phase  defers  decisions  that  are  better  made  by  a 
subsequent  phase  that  will  apply  the  appropriate  acoustic  or 
linguistic  information. 

5.  UNIFIED  STOCHASTIC  ENGINE 

Acoustic  and  language  models  are  usually  constructed  sepa¬ 
rately,  where  language  models  are  derived  from  a  large  text 
corpus  without  consideration  for  acoustic  data,  and  acoustic 
models  are  constructed  from  the  acoustic  data  without  ex¬ 
ploiting  the  existing  text  corpus  used  for  language  training. 
We  recently  have  developed  a  unified  stochastic  engine  (USE) 
that  jointly  optimizes  both  acoustic  and  language  models.  As 
the  true  probability  distribution  of  both  the  acoustic  and  lan¬ 
guage  models  can  not  be  accurately  estimated,  they  can  not  be 
considered  as  real  probabilities  but  scores  from  two  different 
sources.  Since  they  are  scores  instead  of  probabilities,  the 
straightforward  implementation  of  the  Bayes  equation  will 
generally  not  lead  to  a  satisfactory  recognition  performance. 
To  integrate  language  and  acoustic  probabilities  for  decoding, 
we  are  forced  to  weight  acoustic  and  language  probabilities 
with  a  so  called  language  weight  [6].  The  constant  language 
weight  is  usually  tuned  to  balance  the  acoustic  probabilities 
and  the  language  probabilities  such  that  the  recognition  error 
rate  can  be  minimized.  Most  HMM-based  speech  recognition 
systems  have  one  single  constant  language  weight  that  is  in¬ 
dependent  of  any  specific  acoustic  or  language  information, 
and  that  is  determined  using  a  hill-climbing  procedure  on  de¬ 
velopment  data.  It  is  often  necessary  to  make  many  runs  with 
different  language  weights  on  the  development  data  in  order 
to  determine  the  best  value. 

In  the  unified  stochastic  engine  (USE),  not  only  can  we  iter¬ 
atively  adjust  language  probabilities  to  fit  our  given  acous¬ 
tic  representations  but  also  acoustic  models.  Our  multi-pass 


search  algorithm  generates  N-best  hypotheses  which  are  used 
to  optimize  language  weights  or  implement  many  discrimina¬ 
tive  training  methods,  where  recognition  errors  can  be  used 
as  the  objective  function  [20,  25].  With  the  progress  of  new 
database  construction  such  as  DARPA’s  CSR  Phase  II,  we  be¬ 
lieve  acoustically-driven  language  modeling  will  eventually 
provide  us  with  dramatic  performance  improvements. 

In  the  N-best  hypothesis  list,  we  can  assume  that  the  correct 
hypothesis  is  always  in  the  list  (we  can  insert  the  correct 
answer  if  it  is  not  there).  Let  hypothesis  be  a  sequence  of 
words  u>i,  W2,  ...wk  with  corresponding  language  and  acoustic 
probabilities.  We  denote  the  correct  wordsequence  as  6,  and 
all  the  incorrect  sentence  hypotheses  as  6.  We  can  assign  a 
variable  weight  to  each  of  the  n-gram  probabilities  such  that 
we  have  a  weighted  language  probability  as: 

W(W)  =  J]PrK  K  _!«>,•  (!) 

i 

where  the  weight  <*()  is  a  function  of  acoustic  data,  A’,-,  for 
Wi,  and  words  in,-,  w,_i, ....  For  a  given  sentence  k,  a  very 
general  objective  function  can  be  defined  as 

Lk(X)  =  ^Pr(0){-^[/OffPr(A'l|«;<)  + 
e 

+a(Xi,WiWi-i...)logPr(wi\wi-lWi-2—)]  + 

+'%2[lo9Pr(Xi\u>i)+ 

«e» 

+a(^i,u;,u;i_i...)/osfPr(u;,|to1_i...)]}.  (2) 

where  A  denotes  acoustic  and  language  model  parameters  as 
well  as  language  weights,_Pr(0)  denotes  the  a  priori  proba¬ 
bility  of  the  incorrect  path  0,  and  Pr(A',  |to,)  denotes  acoustic 
probability  generated  by  word  model  to,-.  It  is  obvious  that 
when  Ljt  (A)  >  Owe  have  a  sentence  classification  error.  Min¬ 
imization  of  Equation  2  will  lead  to  minimization  of  sentence 
recognition  error  rate.  To  jointly  optimize  the  whole  train¬ 
ing  set,  we  first  define  a  nondecreasing,  differentiable  cost 
function  4(A)  (we  use  the  sigmoid  function  here)  in  the  same 
manner  as  the  adaptive  probabilistic  decent  method  [4,  20]. 
There  exist  many  possible  gradient  decent  procedures  for  the 
proposed  problems. 

The  term  a(Xi, w,w,_i...)/o(/Pr(u;,|w,_i...)  could  be 
merged  as  one  item  in  Equation  2.  Thus  we  can  have  lan¬ 
guage  probabilities  directly  estimated  from  the  acoustic  train¬ 
ing  data.  The  proposed  approach  is  fundamentally  different 
from  traditional  stochastic  language  modeling.  Firstly,  con¬ 
ventional  language  modeling  uses  a  text  corpus  only.  Any 
acoustical  confusable  words  will  not  be  reflected  in  language 
probabilities.  Secondly,  maximum  likelihood  estimation  is 
usually  used,  which  is  only  loosely  related  to  minimum  sen¬ 
tence  error.  The  reason  for  us  to  keep  a()  separate  from  the 
language  probability  is  that  we  may  not  have  sufficient  acous¬ 
tic  data  to  estimate  the  language  parameters  at  present.  Thus, 


we  are  forced  to  have  a()  shared  across  different  words  so  we 
may  have  n -gram-dependent,  word-dependent  or  even  word- 
count-dependent  language  weights.  We  can  use  the  gradient 
decent  method  to  optimize  all  of  the  parameters  in  the  USE. 
When  we  jointly  optimize  L( A),  we  not  only  obtain  our  uni¬ 
fied  acoustic  models  but  also  the  unified  language  models.  A 
preliminary  experiment  reduced  error  rate  by  5%  on  the  WSJ 
task  [14],  We  will  extend  the  USE  paradigm  for  joint  acoustic 
and  language  model  optimization.  We  believe  that  the  USE 
can  further  reduce  the  error  rate  with  an  increased  amount  of 
training  data. 

6.  LANGUAGE  MODELING 

Language  Modeling  is  used  in  Sphinx-II  at  two  different 
points.  First,  it  is  used  to  guide  the  beam  search.  For  that 
purpose  we  used  a  conventional  backoff  bigram  for  that  pur¬ 
pose.  Secondly,  it  is  used  to  recalculate  linguistic  scores  for 
the  top  N  hypotheses,  as  part  of  the  N-best  paradigm.  We 
concentrated  most  of  our  language  modeling  effort  on  the 
latter. 

Several  variants  of  the  conventional  backoff  trigram  language 
model  were  applied  at  the  reordering  stage  of  the  N-best 
paradigm.  (Eventually  we  plan  to  incorporate  this  language 
model  into  the  A*  phase  of  the  multi-pass  search  with  the 
USE).  The  best  result,  a  22%  word  error  rate  reduction,  was 
achieved  with  the  simple,  non-interpolated  “backward”  tri¬ 
gram,  with  the  conventional  forward  trigram  finishing  a  close 
second. 

7.  SUMMARY 

Our  contributions  in  SPHINX-II  include  improved  feature 
representations,  multiple-codebook  semi-continuous  hidden 
Markov  models,  between-word  senones,  multi-pass  search 
algorithms,  and  unified  acoustic  and  language  modeling.  The 
key  to  our  success  is  our  data-driven  unified  optimization  ap¬ 
proach.  This  paper  characterized  our  contributions  by  percent 
error  rate  reduction  on  the  5000-word  WSJ  task,  for  which  we 
reduced  the  word  error  rate  from  20%  to  5%  in  the  past  year 
[2]. 

Although  we  have  made  dramatic  progress  there  remains  a 
large  gap  between  commercial  applications  and  laboratory 
systems.  One  problem  is  the  large  number  of  out  of  vocabu¬ 
lary  (OOV)  words  in  real  dictation  applications.  Even  for  a 
20000-word  dictation  system,  on  average  more  than  25%  of 
the  utterances  in  a  test  set  contain  OOV  words.  Even  if  we 
exclude  those  utterance  containing  OOV  words,  the  error  rate 
is  still  more  than  9%  for  the  20000-word  task  due  to  the  lim¬ 
itations  of  current  technology.  Other  problems  are  illustrated 
by  the  November  1992  DARPA  stress  test  evaluation,  where 
testing  data  comprises  both  spontaneous  speech  with  many 
OOV  words  but  also  speech  recorded  using  several  different 
microphones.  Even  though  we  augmented  our  system  with 
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more  than  20,000  utterances  in  the  training  set  and  a  noise 
normalization  component  [1],  our  augmented  system  only  re¬ 
duced  the  error  rate  of  our  20000-word  baseline  result  from 
12.8%  to  12.4%,  and  the  error  rate  for  the  stress  test  was  even 
worse  when  compared  with  the  baseline  (18.0%  vs.  12.4%). 
To  summarize,  our  current  word  error  rates  under  different 
testing  conditions  are  listed  in  Table  1.  We  can  see  from  this 


Systems 

Vocabulary 

Test  Set 

Error  Rate 

Baseline 

5000 

330  utt. 

5.3% 

Baseline 

20000 

333  utt. 

12.4% 

Stress  Test 

20000 

320  utt. 

18.0% 

Table  1:  Performance  of  SPHINX-II  in  real  applications. 


table  that  improved  modeling  technology  is  still  needed  to 
make  speech  recognition  a  reality. 
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